Food Accessibility in U.S. Counties

Author

Tenzin Sherpa

The Issue

Maintaining a healthy and consistent diet accounts for our most basic and important needs. Access to food sources, however, may be limited for many people, leading to issues of food insecurity. In 2010, 14.5% (17.2 million households) in the U.S. were food insecure, 5.4% of which had very low food security (Coleman-Jensen et al. (2011)).

Many are left to ask: “Where’s my next meal going to come from?”

Goals

Accessibility to food can play an important role in determining whether households have a good and consistent supply of nutritious food. Food accessibility is associated with a person’s proximity to grocery stores, food markets, and other food sources. In this project, I analyze the level of food accessibility for households at the county level and explore how it is linked to various food environment factors, such as income, population demographics, the availability of food assistance programs, and food insecurity. Additionally, I look at the changes in food accessibility from 2010 to 2015, and investigate whether changes in other food environment factors have an impact on food access.

Data

For this project, I use the Food Environment Atlas for 9/10/2020. The dataset has been compiled by the US Department of Agriculture’s (USDA) Economic Research Service (ERS) to study factors that affect food choices and the accessibility of healthy foods in communities. The information in the dataset had been aggregated from various reports from the USDA, the Bureau of the Census, the U.S. Department of Commerce, and more for the years 2006 to 2019. Food accessibility population data, which is the focus of the project, were given for 2010 and 2015.

The dataset contains county-level information on food environment factors such as access to grocery stores/supermarkets/restaurants, local food sales, food prices, food assistance programs like SNAP (Supplemental Nutrition Assistance Program), National School Lunch Program, etc., socioeconomic characteristics, and some health/physical activities. State-level information on household food insecurity is also given. The dataset contained data on 3143 counties, each uniquely identified with their FIPS code.

To enable geographic visualizations and analysis, an additional file from the ArcGIS Hub containing geographic boundary shapes is used.

Preprocessing

The information in the dataset was stored in separate worksheets, grouped according to the category of the features. Since there was a large number of features, the first step was to perform feature reduction. This was done by manually selecting the features that were related to the project’s questions and goals. Features that contained very granular information were also omitted in order to simplify analysis as well as to obtain more general insights.

The dataset and boundary shape file were also compared to check if county FIPS codes and names matched for all observations. Some county names and FIPS codes had changed over the years, which caused discrepancies between the two datasets. These counties were identified and modified to reflect the changes and to ensure observations matched. This brought the number of observations from 3143 to 3142 counties.

For columns with a low number of missing values, imputation was done using the median of those features. This was done while looking at variable distributions to ensure that imputation did not significantly change their distributions. However, median imputation was not appropriate for some 2015 variables that had a larger number of missing values. For these variables, missing values were replaced with numbers from the corresponding feature for 2010.

Overall, the data preprocessing involved merging data, selecting relevant variables and imputing missing values while exploring the nature of the variables. The final cleaned dataset had 3142 rows and 100 columns. A few rows of the final dataset are shown below.

Code
import pandas as pd
df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')
df.head()
FIPS State County LACCESS_POP10 LACCESS_POP15 PCH_LACCESS_POP_10_15 PCT_LACCESS_POP10 PCT_LACCESS_POP15 LACCESS_LOWI10 LACCESS_LOWI15 ... PCT_HISP10 PCT_NHASIAN10 PCT_NHNA10 PCT_NHPI10 PCT_65OLDER10 PCT_18YOUNGER10 MEDHHINC15 POVRATE15 CHILDPOVRATE15 METRO13
0 01001 AL Autauga 18428.439685 17496.693038 -5.056026 33.769657 32.062255 5344.427472 6543.676824 ... 2.400542 0.855766 0.397647 0.040314 11.995382 26.777959 56580.0 12.7 18.8 1
1 01003 AL Baldwin 35210.814078 30561.264430 -13.204891 19.318473 16.767489 9952.144027 9886.831137 ... 4.384824 0.735193 0.628755 0.043343 16.771185 22.987408 52387.0 12.9 19.6 1
2 01005 AL Barbour 5722.305602 6069.523628 6.067799 20.840972 22.105560 3135.676086 2948.790251 ... 5.051535 0.389700 0.218524 0.087409 14.236807 21.906982 31433.0 32.0 45.2 0
3 01007 AL Bibb 1044.867327 969.378841 -7.224696 4.559753 4.230324 491.449066 596.162829 ... 1.771765 0.096007 0.279293 0.030548 12.681650 22.696923 40767.0 22.2 29.3 1
4 01009 AL Blount 1548.175559 3724.428242 140.568857 2.700840 6.497380 609.027708 1650.959482 ... 8.070200 0.200621 0.497191 0.031402 14.722096 24.608353 50487.0 14.7 22.2 1

5 rows × 100 columns

Analysis

Getting Started

To investigate the relationship between food accessibility and food environment factors, I utilized visualizations and regression models as the primary analysis methods. The analysis requires the following Python library packages:

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import json

# regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

Visualizations

Food Accessibility

For this project, low accessibility populations are defined to be those “living more than 1 mile from a supermarket or large grocery store if in an urban area, or more than 10 miles from a supermarket or large grocery store if in a rural area” (Economic Research Service (n.d.)). The dataset provides information on the percentage of the county population with low food accessibility. Figure 1 shows the distribution of this variable.

To read Figure 1

The variable here represents the percentage of the county population that has limited access to food due to living far from food stores. For instance, a value of 40% for a county would indicate that 40% of its population has poor access to food due to living far from stores.
Now, Figure 1 shows the distribution of low food access population percentages across different counties. Each bar represents the number of counties that have a certain range of low access population percentage. For instance, the first bar on the left indicates that there are around 350 counties where 0-5% of their population has limited access to food.
Moving from left to right on the horizontal axis, the level of food accessibility decreases, with the rightmost bar representing counties where almost 100% of their residents have poor access to food i.e. live far from food stores.

Code
df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')

sns.set(style='darkgrid')
def plt_dis(c):
    f = sns.displot(data=df, x=c, height=4, aspect=10/8.27, bins=20)
    plt.xlim([0, 100])
    plt.ylabel("Number of Counties", alpha=0.8)
    plt.xlabel("Low Food Access Pop. %", alpha=0.8)
    plt.show()

plt_dis('PCT_LACCESS_POP10')

Figure 1: Distribution of County % Population with Low Food Accessibility

The distribution plot reveals an intriguing characteristic of food accessibility in the U.S. Specifically, it illustrates that there are two distinct groups of counties based on the extent of food accessibility: those with low accessibility and those that are worse off with extremely low accessibility. The left peak in the graph represents the first group of counties, for which less than 40% of the populations have limited access to food. Majority of the counties fall under this group. However, the second peak reveals that there are also counties facing worse food accessibility issues. For approximately 130 counties, the entire population (up to 100%) has limited access to food sources.

Geographic Visualization

To gain a better spatial understanding of the location of low-accessibility counties in the U.S., I plotted the variable on a chloropleth map.

To read Figure 2

The map displays the percentage of the population with limited food access on a map. The color scale ranges from 0 to 100%, with the darkest shade representing 0% and the brightest shade representing 100%. Therefore, the bright yellow counties on the map indicate areas where almost 100% of the population has limited access to food due to living far from stores.

Yellow counties have bad food access and the blue ones have better access.

Code
df = pd.read_pickle(r'./../data/AtlasCleaned.pkl')

with open(r'./../src/USA_Counties_(Generalized).geojson') as f:
    tract = json.load(f)

def plot_chlpth(df, x, label, header):
  fig = px.choropleth(df, locations='FIPS', color=x, color_continuous_scale="Viridis",
                        range_color=(0, 100), geojson=tract, featureidkey = "properties.FIPS",
                        scope="usa", hover_data=['State', 'County'],
                        labels=label, title=header)
  fig.update_layout(legend=dict(orientation="h", yanchor="top",y=0.5,xanchor="right",x=1))
  fig.show()

plot_chlpth(df, 'PCT_LACCESS_POP10', {'PCT_LACCESS_POP10':'Popn. %'}, None)

Figure 2: Low Food Accessibility Population Percentages in 2010

Figure 2 shows that low food accessibility counties are concentrated in the Midwest, Southwest, and West regions of the country. This could be due to the remote and isolated geography of these areas, resulting in a scarcity of food stores. Conversely, urban and densely populated regions, such as the Eastern and coastal areas, tend to have better food access. Hence, geographical location appears to play a role in determining the level of food accessibility in a county, as it is defined based on the proximity of households to food stores.

Change in Food Accessibility 2010-2015

Next, I investigate how the population level of low food accessibility changed in counties from 2010 to 2015. To analyze this, I created a variable that accounts for the difference by subtracting the 2010 values from the 2015 values of the percentage of the county population with low food accessibility. This variable is then visualized using a county-level chloropleth map.

To read Figure 3

Change = 2015 value - 2010 value of Low Food Access Population %

Red Counties –> Food access gets worse in 2015 (Change > 0 because 2015 number was greater; a greater number means that more people did not have good food accessiblity)

Blue Counties –> Food access improves in 2015

White Counties –> Food access stays same

Code
df_diff = pd.DataFrame(df, columns = ['FIPS', 'State', 'County', 'PCT_LACCESS_POP10', 'PCT_LACCESS_POP15'])
df_diff['diff'] = df['PCT_LACCESS_POP15'] - df['PCT_LACCESS_POP10']

fig = px.choropleth(df_diff, locations='FIPS',  
        color='diff',
        color_continuous_scale="RdBu_r",
        range_color=(df_diff['diff'].min(), df_diff['diff'].max()),
        geojson=tract,
        featureidkey = "properties.FIPS",
        scope="usa", 
        hover_data=['State', 'County'],
        labels={'diff':'Change in %'})
fig.update_layout(legend=dict(orientation="h", yanchor="top",y=0.5,xanchor="right",x=1))
fig.show()

Figure 3: Change in Low Food Access Pop. % from 2010 to 2015

The chloropeth map shows that while there was no significant change in most counties, there were some counties that experienced remarkable improvements or deteriorations in food accessibility. These counties are denoted by the blue and red colors, respectively. Interestingly, these counties are concentrated in the Midwest and West regions of the country, similar to the location of low-access counties shown in the previous map.

Food and Vehicle Access

The dataset measures food accessibility based on the proximity of residential areas to food stores. However, it is important to note that the availability of vehicles can also play a significant role in determining food accessibility. While some populations may ive far from stores, having a car or access to transportation can help residents travel greater distances to purchase food.

To take account of this additional variable, we look at the density plot of household vehicle availability and county food accessibility. The blocks farther away from the axes represent the areas where both food accessibility and vehicle availability for the populations living there are low. These counties are likely to have the biggest challenges of food accessibility.

Code
def plt_dis(c):
    f = sns.displot(data=df, x=c, y='PCT_LACCESS_POP10', height=4, aspect=10/8.27, bins=10)
    plt.xlim([-5, 100])
    plt.ylabel("Low Food Access Pop. %", alpha=0.9)
    plt.xlabel("Household % with Low Food Access & No Vehicle", alpha=0.8)
    plt.show()

plt_dis('PCT_LACCESS_HHNV10')

Figure 4: Low Food Access Population and Household Vehicle Availability

In other counties, although food accessibility is bad, most households have access to vehicles (shown by the few dark areas). For most counties, 0-40% of residents have low food accessibility, but only less than 10% of households with low food access do not have vehicles. In other words, 90% of residents with low food accessibility do have access to a car.

Additionally, for some of the counties with bad food access, only a small proportion of households did not have access to cars (shown by top left boxes). This could mean that the problem of food accessibility in these counties may be lower than depicted earlier. Food accessibility may be the most problematic in only a few counties (shown by the boxes farther away from the axes).

Income and Food Access

Income is another crucial factor that can impact access to food. Low-income groups may be more sensitive to food prices and the lack of food availability in close proximity to their homes. The scatter plot charts the population with low food accessibility and populations with both low income and low food access. Here, low income is defined as “annual family income of less than or equal to 200 percent of the Federal poverty threshold based on family size” (Economic Research Service (n.d.)).

Code
sns.set(color_codes=True)
f, ax = plt.subplots(figsize=(6, 5))
sns.scatterplot(data=df, x='PCT_LACCESS_POP10', y='PCT_LACCESS_LOWI10')
ax.set_ylabel("County Population % with Low Access and Low Income", size = 12, alpha=0.9)
ax.set_xlabel("County Population % with Low Food Access",size = 12, alpha=0.9)
ax.tick_params(axis='both', which='major', labelsize=14)
plt.xlim([0, 102])
xpoints = ypoints = ax.get_xlim()
ax.plot(xpoints, ypoints, linestyle='--', color='cornflowerblue', lw=1, scalex=False, scaley=False)
f.show()

Figure 5: Low Food Access Population and Household Vehicle Availability

The plot shows that the data points always fall below the identity function line. So, the proportion of population with both low food access and low income is less than the population percentage with low food access. This suggests that not all of the population with low food access have low income (as defined for the datset). However, the plot also shows that counties with worse food access have greater percentages of people with low access and low income. Thus, income can be correlated with food accessibility. Furthermore, the variation in these percentage values increases as accessibility worsens. This is also reminiscent of the variation in household availability for counties with bad food access.

Demographic Characteristics

To explore the racial composition of counties with very low food accessibility, the dataset was filtered to only include counties where more than 80% of the population had low food accessibility. The average population percentages of different race groups were then calculated for these counties.

Table 1 shows that in counties where more than 80% of the population has low food accessibility, the majority of the population is White, followed by Hispanic and American Indian / Alaska Native.

Code
from IPython.display import Markdown
from tabulate import tabulate
# Create a separate dataframe containing county info and race categories
df_race = pd.DataFrame(df, columns=['FIPS', 'State', 'County', 'PCT_LACCESS_POP10','PCT_NHWHITE10',
                                    'PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10','PCT_NHNA10',
                                    'PCT_NHPI10', 'PCT_LACCESS_POP15', 'PCT_LACCESS_WHITE15',
                                    'PCT_LACCESS_BLACK15', 'PCT_LACCESS_HISP15','PCT_LACCESS_NHASIAN15', 
                                    'PCT_LACCESS_NHNA15', 'PCT_LACCESS_NHPI15','PCT_LACCESS_MULTIR15'])

# Store avg pct values for each race here, create table using this dictionary
table = []
# all race categories as given by the dataset
race_catg = ["WHITE", "BLACK", "HISPANIC", "ASIAN", "AMER INDIAN / ALASKA NATIVE", "HAWAIIAN / PACIFIC ISLANDER", "MULTIRACIAL"]

# 2010 very low access counties 
low_access10 = df_race[df['PCT_LACCESS_POP10']>80]
# Update dictionary with 2010 mean values for very low access counties
for (mean_pct,catg) in zip(low_access10.mean()[1:7],race_catg):
  table.append([catg, round(mean_pct, 2)])

# 2015 
low_access15 = df_race[df['PCT_LACCESS_POP15']>80]
for (mean_pct,i) in zip(low_access15.mean()[8:],range(len(race_catg))):
  mean_pct = round(mean_pct, 2)
  try:
    table[i].append(mean_pct)
  except IndexError:
    table[-1].append(0)
    table[-1].append(mean_pct)  # accounting for the extra MULTIRACIAL catg for 2015

Markdown(tabulate(table, 
  headers=["Race","2010", "2015"]
))
Table 1: Racial Composition in Low Food Access Counties
Race 2010 2015
WHITE 84.93 85.05
BLACK 0.6 0.91
HISPANIC 9.26 9.92
ASIAN 0.3 0.34
AMER INDIAN / ALASKA NATIVE 3.63 5.61
HAWAIIAN / PACIFIC ISLANDER 0.03 0.06

Features’ Relationships

Next, we explore the relationship between food accessibility and two other variables in the dataset: food insecurity and the presence of food assistance programs. First, we examine if food accessibility variables are related to food insecurity. Second, we investigate if there is a relationship between the presence of food assistance programs and food accessibility levels.

Food Insecurity and Accessibility

Food-insecure households are those that “were unable, at times during the year, to provide adequate food for one or more household members because the household lacked money and other resources for food” (Economic Research Service (n.d.)). The values measuring food insecurity were given as three-year averages. In the following model, the food insecurity percent average for 2012-2014 is used as the response variable for the training set whereas the food insecurity percent average for 2015-2017 is used as the response variable for the test set. However, the dataset provides only state-level information for this feature. Therefore, the data frame used here was created by grouping the rows by state and taking the average. The final input data frame contains 51 rows with state-level information.
Response Variable is: State Food Insecurity percentage (three year average) Input Variables are the average by state of:

  • Low Food Access Population %
  • County Population % with Low Food Access and Low Income
  • Household % with Low Food Access and no Car

Training and test dataset are 2010 and 2015 data respectively

Code
# Selecting variables 
df.drop(labels=['LACCESS_POP10', 'LACCESS_POP15', 'PCH_LACCESS_POP_10_15', 'LACCESS_LOWI10', 'LACCESS_LOWI15',
                    'PCH_LACCESS_LOWI_10_15', 'LACCESS_HHNV10', 'LACCESS_HHNV15', 'PCH_LACCESS_HHNV_10_15',
                    'LACCESS_SNAP15', 'LACCESS_CHILD10', 'LACCESS_CHILD15', 'LACCESS_CHILD_10_15', 
                    'LACCESS_SENIORS10', 'LACCESS_SENIORS15', 'PCH_LACCESS_SENIORS_10_15', 'LACCESS_WHITE15', 
                    'LACCESS_BLACK15', 'LACCESS_HISP15', 'LACCESS_NHASIAN15', 'LACCESS_NHNA15', 
                    'LACCESS_NHPI15', 'LACCESS_MULTIR15'], axis=1, inplace=True)

df_avg = df.groupby(['State']).mean()

df10_cols = ['PCT_LACCESS_POP10', 'PCT_LACCESS_LOWI10', 'PCT_LACCESS_HHNV10',
             'PCT_LACCESS_CHILD10','PCT_LACCESS_SENIORS10', 'PCT_FREE_LUNCH10', 'PCT_REDUCED_LUNCH10',
             'PCT_NHWHITE10', 'PCT_NHBLACK10', 'PCT_HISP10', 'PCT_NHASIAN10', 'PCT_NHNA10', 'PCT_NHPI10',
             'PCT_65OLDER10', 'PCT_18YOUNGER10',
             'GROCPTH11', 'SUPERCPTH11', 'CONVSPTH11', 'SPECSPTH11', 
             'SNAPSPTH12', 'FFRPTH11', 'FSRPTH11', 'PC_FFRSALES07','PC_FSRSALES07', 'PCT_SNAP12', 
             'SNAP_PART_RATE11', 'PCT_NSLP12','PCT_SBP12', 'PCT_SFSP12',
             'FDPIR12','FOODINSEC_12_14','VLFOODSEC_12_14','DIRSALES_FARMS07', 'FMRKTPTH13']


df15_cols = ['PCT_LACCESS_POP15', 'PCT_LACCESS_LOWI15', 'PCT_LACCESS_HHNV15', 
             'PCT_LACCESS_SNAP15', 'PCT_LACCESS_CHILD15', 'PCT_LACCESS_SENIORS15', 'PCT_LACCESS_WHITE15',
             'PCT_LACCESS_BLACK15','PCT_LACCESS_HISP15', 'PCT_LACCESS_NHASIAN15', 'PCT_LACCESS_NHNA15', 
             'PCT_LACCESS_NHPI15', 'PCT_LACCESS_MULTIR15', 'PCT_FREE_LUNCH15', 'PCT_REDUCED_LUNCH15', 
             'GROCPTH16', 'SUPERCPTH16', 'CONVSPTH16', 'SPECSPTH16', 'SNAPSPTH17', 'FFRPTH16', 'FSRPTH16',
             'PC_FFRSALES12', 'PC_FSRSALES12', 'PCT_SNAP17','SNAP_PART_RATE16', 'PCT_NSLP17', 'PCT_SBP17', 'PCT_SFSP17',
             'FDPIR15', 'FOODINSEC_15_17','VLFOODSEC_15_17', 'DIRSALES_FARMS12', 'FMRKTPTH18']

# Foodhub, metro 13, medhincome 15 'POVRATE15' are not here
Code
# create separate dataframes for 2010 and 2015
df10 = pd.DataFrame(df_avg, columns = df10_cols)
df15 = pd.DataFrame(df_avg, columns = df15_cols)

# Standardizing dataframe
scaler = StandardScaler()
df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)
df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)

# Rename columns in both dataframes so all column names are common
df10.rename(columns = lambda x : str(x)[:-2], inplace=True)
df15.rename(columns = lambda x : str(x)[:-2], inplace =True)

# 2010 data as training data
X_train = df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK',
                       'PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 
                       'PCT_65OLDER', 'PCT_18YOUNGER', 'VLFOODSEC_12_', 'FOODINSEC_12_'], axis=1)
y_train = df10['FOODINSEC_12_']

# 2015 as test data
X_test = df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE',
             'PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 
             'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR', 'VLFOODSEC_15_', 'FOODINSEC_15_'], axis=1)
y_test = df15['FOODINSEC_15_']

X_train = X_train.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH',
       'CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP',
       'PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)


X_test = X_test.drop(columns=['PCT_FREE_LUNCH', 'PCT_REDUCED_LUNCH', 'GROCPTH', 'SUPERCPTH',
       'CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'PCT_SNAP', 'SNAP_PART_RATE', 'PCT_NSLP', 'PCT_SBP',
       'PCT_SFSP', 'FDPIR', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)

# Linear regression model
X = sm.add_constant(X_train)
model1 = sm.OLS(y_train, X).fit()
model1.summary()
# getting test score
model2 = LinearRegression()
model2.fit(X_train,y_train)

# OUtput 
print('R-squared:', model1.rsquared)
print('Model Test R-squared:', model2.score(X_test,y_test))
print()
print("The p-values for the linear regression model are as follows:")
print('P-values:', model1.pvalues)
R-squared: 0.10953682615612048
Model Test R-squared: 0.018688402604447707

The p-values for the linear regression model are as follows:
P-values: const               1.000000
PCT_LACCESS_POP     0.025167
PCT_LACCESS_LOWI    0.023915
PCT_LACCESS_HHNV    0.592164
dtype: float64

The linear regression returns a low \(r^2\) score. Although this may be partly attributed to the small training set and the availability of only state-level food insecurity data, the extremely low \(r^2\) suggests that food accessibility is not strongly correlated with food insecurity. This means that households living far from food stores may still have good food security, and distance from food stores does not entirely dictate the level of food insecurity. Looking at the p-values, the food access population and income variables have the lowest p-values suggesting that they are the more significant variables in determining food insecurity.

Furthermore, the regression model used to predict state food insecurity for 2015 returns a lower \(r^2\), which suggests that changes in food accessibility do not fully account for changes in food insecurity. Other factors not present in the model and the dataset may be needed for determining food insecurity levels in a state.

Food Assistance Programs

The availability of food assistance programs in the county can help with improving food accessibility in the region. We use a regression model to study the relationship between food assistance accessibility and the numerous variables related to food assistance programs provided in the dataset. Our goal is to determine if these programs are available in areas with low food accessibility where the need for such programs may be the greatest.
Regression models are created separately for 2010 and 2015 to see how the availability of food assistance programs and food accessibility are related for both years. Following the previous analysis, the same state-level data frame is used for the regression models.

The response variable is the state average Low Food Access Population Percentage.

The input variables here are the state-level averages of:

  • Students eligible for free lunch (%)
  • Students eligible for reduced-price lunch (%)
  • SNAP participants (% pop)
  • SNAP benefits per capita (% change)
  • National School Lunch Program participants (%)
  • School Breakfast Program participants (% children)
  • Summer Food Service Program participants (% children)
  • FDPIR (Food Distribution Program on Indian Reservations) sites
Code
# create 2010 and 2015 dataframes for states
df10 = pd.DataFrame(df_avg, columns = df10_cols)
df15 = pd.DataFrame(df_avg, columns = df15_cols)

scaler = StandardScaler()

df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)
df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)

df10.rename(columns = lambda x : str(x)[:-2], inplace=True)
df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)

df15.rename(columns = lambda x : str(x)[:-2], inplace =True)
df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)


df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK',
                       'PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 
                       'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)
df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE',
             'PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 
             'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)

# training and test datasets according to year
x10 = df10.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)
y10 = df10['PCT_LACCESS_POP']

x15 = df15.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)
y15 = df15['PCT_LACCESS_POP']

x10 = x10.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV',
       'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)
x15 =  x15.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV',
       'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)

#  2010 model
X = sm.add_constant(x10)
model = sm.OLS(y10, X).fit()
model.summary()
print("2010: ")
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)
print()
#  2015 model
X = sm.add_constant(x15)
model = sm.OLS(y15, X).fit()
model.summary()
print("2015: ")
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)
2010: 
R-squared: 0.3607467572659099
P-values: const                1.000000
PCT_FREE_LUNCH       0.840339
PCT_REDUCED_LUNCH    0.944982
PCT_SNAP             0.725729
SNAP_PART_RATE       0.321000
PCT_NSLP             0.316383
PCT_SBP              0.618867
PCT_SFSP             0.422621
FDPIR                0.061563
dtype: float64

2015: 
R-squared: 0.46831649180961343
P-values: const                1.000000
PCT_FREE_LUNCH       0.029108
PCT_REDUCED_LUNCH    0.742560
PCT_SNAP             0.031647
SNAP_PART_RATE       0.034399
PCT_NSLP             0.378130
PCT_SBP              0.069722
PCT_SFSP             0.063479
FDPIR                0.255548
dtype: float64

For both years, the \(r^2\) scores are similar and less than 0.5. This suggests that the prevalence of food assistance programs is only partly correlated with food accessibility. To check if the low \(r^2\) is due to variation within states, we train and apply the regression model at the county level.

Code
# create 2010 and 2015 dataframes for states
df10 = pd.DataFrame(df, columns = df10_cols)
df15 = pd.DataFrame(df, columns = df15_cols)

scaler = StandardScaler()
df10 = pd.DataFrame(scaler.fit_transform(df10), columns=df10.columns)
df15 = pd.DataFrame(scaler.fit_transform(df15), columns=df15.columns)

df10.rename(columns = lambda x : str(x)[:-2], inplace=True)
df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)

df15.rename(columns = lambda x : str(x)[:-2], inplace =True)
df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)

df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK',
                       'PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 
                       'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)
df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE',
             'PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 
             'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)

# training and test datasets according to year
x10 = df10.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)
y10 = df10['PCT_LACCESS_POP']

x15 = df15.drop(columns=['VLFOODSEC', 'FOODINSEC'], axis=1)
y15 = df15['PCT_LACCESS_POP']

x10 = x10.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV',
       'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)
x15 =  x15.drop(columns=['PCT_LACCESS_POP', 'PCT_LACCESS_LOWI', 'PCT_LACCESS_HHNV',
       'GROCPTH', 'SUPERCPTH','CONVSPTH', 'SPECSPTH', 'SNAPSPTH', 'FFRPTH', 'FSRPTH', 'PC_FFRSALES',
       'PC_FSRSALES', 'DIRSALES_FARMS', 'FMRKTPTH'], axis=1)

#  2010 model
X = sm.add_constant(x10)
model = sm.OLS(y10, X).fit()
model.summary()
print("2010: ")
print('R-squared:', model.rsquared)
print()
#  2015 model
X = sm.add_constant(x15)
model = sm.OLS(y15, X).fit()
model.summary()
print("2015: ")
print('R-squared:', model.rsquared)
2010: 
R-squared: 0.08965234707232639
2015: 
R-squared: 0.08183296723917333

The county-level model performs worse than the state-level model, as shown by the lower \(r^2\) scores. This suggests that food assistance programs may not be available in all counties where food access is difficult. However, the better performing state-level model suggests that people can travel within the state to a different county to access these food assistance programs. Therefore, food asssistance programs in the country may still be effective if traveling within state is not an issue.

The previous analysis showed that food insecurity is not entirely limited by accessibility. If this is the case and if food assitance programs are available in nearby counties, low food accessibility populations may still be able to maintain better food security.

Change in Food Accessibility (2010-15)

Lastly, I use a regression model to analyze the change in food accessibility across U.S. counties. The response variable is the difference between the food accessibility population percentage for 2010 and 2015, calculated by subtracting the 2010 values from the 2015 values. This is the same variable that was used earlier to generate the geographic visualization of the change in accessibility.

The predictor variables used here are listed below along with their descriptions. The variables encompass information such as the number of food stores, farm sales, expernditure on other food sources like restaurants, food insecurity levels, and access to food assistance programs.

Code
import pandas as pd
table = {
'PCT_REDUCED_LUNCH': "Students eligible for reduced-price lunch (%)",
'GROCPTH': "Grocery stores/1,000 pop",
'SUPERCPTH':"Supercenters & club stores/1,000 pop",
'CONVSPTH':"Convenience stores/1,000 pop",
'SPECSPTH':"Specialized food stores/1,000 pop",
'SNAPSPTH':"SNAP-authorized stores/1,000 pop",
'FSRPTH':"Fast-food restaurants/1,000 pop",
'PC_FFRSALES':"Expenditures per capita, fast food",
'PC_FSRSALES':"Expenditures per capita, restaurants",
'PCT_SNAP':"SNAP participants (% pop)",
'SNAP_PART_RATE':"SNAP participants (% eligible pop)",
'PCT_NSLP':"National School Lunch Program participants (% children)",
'PCT_SBP':"School Breakfast Program participants (% children)",
'PCT_SFSP':"Summer Food Service Program participants (% children)",
'FDPIR':"FDPIR Sites",
'FOODINSEC':"State Household food insecurity (%, three-year average",
'VLFOODSEC':"State Household very low food insecurity (%, three-year average",
"DIRSALES_FARMS":"Farms wth direct sales",
'FMRKTPTH':"Farmers' markets/1,000 pop"}
var = pd.DataFrame.from_dict(table, orient = 'index')
var
0
PCT_REDUCED_LUNCH Students eligible for reduced-price lunch (%)
GROCPTH Grocery stores/1,000 pop
SUPERCPTH Supercenters & club stores/1,000 pop
CONVSPTH Convenience stores/1,000 pop
SPECSPTH Specialized food stores/1,000 pop
SNAPSPTH SNAP-authorized stores/1,000 pop
FSRPTH Fast-food restaurants/1,000 pop
PC_FFRSALES Expenditures per capita, fast food
PC_FSRSALES Expenditures per capita, restaurants
PCT_SNAP SNAP participants (% pop)
SNAP_PART_RATE SNAP participants (% eligible pop)
PCT_NSLP National School Lunch Program participants (% ...
PCT_SBP School Breakfast Program participants (% child...
PCT_SFSP Summer Food Service Program participants (% ch...
FDPIR FDPIR Sites
FOODINSEC State Household food insecurity (%, three-year...
VLFOODSEC State Household very low food insecurity (%, t...
DIRSALES_FARMS Farms wth direct sales
FMRKTPTH Farmers' markets/1,000 pop

All variables are standardized before training and applying the regression model.

Note

The income and household vehicle availability variables are removed here since they are largely correlated with the food accessibility response variables.

Code
# create 2010 and 2015 dataframes for all counties
df10 = pd.DataFrame(df, columns = df10_cols)
df15 = pd.DataFrame(df, columns = df15_cols)

# Rename columns
df10.rename(columns = lambda x : str(x)[:-2], inplace=True)
df10.rename(columns={'FOODINSEC_12_': 'FOODINSEC','VLFOODSEC_12_': 'VLFOODSEC'}, inplace=True)
df15.rename(columns = lambda x : str(x)[:-2], inplace =True)
df15.rename(columns={'FOODINSEC_15_': 'FOODINSEC','VLFOODSEC_15_': 'VLFOODSEC'}, inplace=True)

# Drop columns
df10.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_NHWHITE', 'PCT_NHBLACK',
                       'PCT_HISP', 'PCT_NHASIAN', 'PCT_NHNA', 'PCT_NHPI', 
                       'PCT_65OLDER', 'PCT_18YOUNGER'], inplace=True)
df15.drop(columns=['PCT_LACCESS_CHILD','PCT_LACCESS_SENIORS','PCT_LACCESS_SNAP','PCT_LACCESS_WHITE',
             'PCT_LACCESS_BLACK','PCT_LACCESS_HISP', 'PCT_LACCESS_NHASIAN', 'PCT_LACCESS_NHNA', 
             'PCT_LACCESS_NHPI', 'PCT_LACCESS_MULTIR'], inplace=True)

# Create dataframe with differences between 2010 & 2015 values as columns
diff = pd.DataFrame()
for c in df10.columns:
    diff[c] = df15[c]-df10[c]

# Scale all columns
scaler = StandardScaler()
diff = pd.DataFrame(scaler.fit_transform(diff), columns=diff.columns)

# Drop correlating variables and create train/test data
X = diff.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)
y = diff['PCT_LACCESS_POP']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply linear regression
X = sm.add_constant(X_train)
model = sm.OLS(y_train, X).fit()
model.summary()
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)
R-squared: 0.02122744699035428
P-values: const                0.716920
PCT_FREE_LUNCH       0.696828
PCT_REDUCED_LUNCH    0.857769
GROCPTH              0.211986
SUPERCPTH            0.045945
CONVSPTH             0.482173
SPECSPTH             0.427779
SNAPSPTH             0.325925
FFRPTH               0.289545
FSRPTH               0.450601
PC_FFRSALES          0.763854
PC_FSRSALES          0.441492
PCT_SNAP             0.242449
SNAP_PART_RATE       0.001517
PCT_NSLP             0.035178
PCT_SBP              0.093460
PCT_SFSP             0.058786
FDPIR                0.236457
FOODINSEC            0.245323
VLFOODSEC            0.788494
DIRSALES_FARMS       0.989146
FMRKTPTH             0.007023
dtype: float64

The model used here does not produce good \(r^2\) scores. One reason for this may be that the counties for which there are significant changes are low in number and may seem like unusual values for the dataset. This is shown by the distribution plot shown in Figure 6.

Code
# Plot distribution of difference in low food access pop% 
def plt_dis(c):
    f = sns.displot(data=diff, x=c, height=4, aspect=10/8.27, bins=20)
    plt.ylabel("Number of Counties", alpha=0.8)
    plt.xlabel("Change in Low Food Access Pop. % (standardized)", alpha=0.8)
    plt.show()
plt_dis('PCT_LACCESS_POP')

Figure 6: Distribution of Change in Low Access Population (Standardized)

Now, we create a model using data for only the counties that have significant changes in food accessibility. For this, the dataset was filtered to contain only counties for which the standardized Change in Food Accessibility was greater than 1.5 and less than -1.5.
We apply both linear regression and decision tree regression models and compare their performance.

Code
# filter data for extreme cases
df_extreme = diff[diff['PCT_LACCESS_POP'] > 1.5]
df_extreme = diff[diff['PCT_LACCESS_POP'] < -1.5]

# Drop correlated variables and create train/test data
X = df_extreme.drop(columns=['PCT_LACCESS_POP','PCT_LACCESS_LOWI','PCT_LACCESS_HHNV'], axis=1)
y = df_extreme['PCT_LACCESS_POP']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear regression
print("Linear Regression")
X = sm.add_constant(X_train)
model = sm.OLS(y_train, X).fit()
model.summary()
print('R-squared:', model.rsquared)
print('P-values:', model.pvalues)
print()
print()
# Decision Tree
print("Decision Tree Regression")
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_score = dt_model.score(X_test, y_test)
print("R-aquared score:", dt_score)
Linear Regression
R-squared: 0.24170625736787055
P-values: PCT_FREE_LUNCH       3.299206e-01
PCT_REDUCED_LUNCH    4.821978e-01
GROCPTH              4.878081e-01
SUPERCPTH            8.757824e-01
CONVSPTH             4.228270e-01
SPECSPTH             1.969613e-01
SNAPSPTH             1.497012e-01
FFRPTH               7.908066e-01
FSRPTH               7.414277e-01
PC_FFRSALES          9.810093e-01
PC_FSRSALES          9.459537e-01
PCT_SNAP             5.318101e-01
SNAP_PART_RATE       3.859439e-01
PCT_NSLP             5.183571e-01
PCT_SBP              4.437169e-01
PCT_SFSP             6.699933e-01
FDPIR                2.183979e-08
FOODINSEC            6.302984e-01
VLFOODSEC            6.250474e-01
DIRSALES_FARMS       4.566845e-01
FMRKTPTH             5.418631e-01
dtype: float64


Decision Tree Regression
R-aquared score: 0.4282723731862612

The \(r^2\) score for the linear regression model improved significantly when considering only counties for which food accessibility changed notably. When using a decision tree regressor, the \(r^2\) score improved even further. It is possible that these counties with notable changes in food accessibility are also the same counties with extremely low food access, as seen in Figure 3 and Figure 2. This brings us back to the two distinct groups of counties shown in Figure 1; the counties with significant changes in food access may belong to the group of counties where food access tends to be very low. Analyzing the data based on these groups may lead to better results.

The smallest p-value here is for the FDPIR variable which measures the Food Distribution Program on Indian Reservations. This variable may have been significant in the model because the extreme counties filtered here fall in regions where Native American populations are more significant.

Conlusion

This project analyzed food accessibility and other food environment factors. The analysis showed that counties with low food accessibility can be divided into two groups: those with low and very low food accessibility. The counties with very low food accessibility are located in the West and Midwest regions of the country, with populations primarily consisting of White, Hispanic, and Native American/ Alaska Native people. Additionally, the analysis also revealed a correlation between income and food accessibility, as a significant proportion of the population in counties with low food accessibility had low incomes.

Initially, all counties were found to have low food accessibility. However, examining household vehicle availability suggested that the problem may be less severe, as the majority of counties with poor food access had access to cars. There were only a few counties in which low food accessibility coincided with a lack of available vehicles.

Food accessibility remained largely remained the same from 2010 to 2015, with significant changes only occurring in counties with very low food accessibility. The regression model used here tried to look explain the changes by looking at how number of food stores/sources and assistance programs changed. However, these factors were not sufficient build an accurate model which is likely due to the absence of significant variables in the dataset.

Finally, regression models were used to assess the relationship between food insecurity and food accessibility, as well as other food environment factors. The analysis revealed that food insecurity cannot be fully explained by food accessibility alone. Additional factors such as income, poverty rates, and unemployment levels may also be significant. However, these variables could not be included in the current models due to their absence from the dataset or overlap with the response variables. Future analysis could incorporate such variables from other datasets to gain a more comprehensive understanding of food insecurity.

References

Coleman-Jensen, Alisha, Mark Nord, Margaret Andrews, and Steven Carlson. 2011. “Household Food Security in the United States in 2010.” ERR-125, U.S. Dept. Of Agriculture, Econ. Res. Serv.
Economic Research Service, U. S. Department of Agriculture (USDA). n.d. “Food Environment Atlas Documentation.”